Coding sequence density estimation via topological pressure.
نویسندگان
چکیده
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/ .
منابع مشابه
J an 2 01 4 CODING SEQUENCE DENSITY ESTIMATION VIA TOPOLOGICAL PRESSURE
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the ‘weighted information content’ of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters...
متن کاملWavelet Based Estimation of the Derivatives of a Density for m-Dependent Random Variables
Here, we propose a method of estimation of the derivatives of probability density based wavelets methods for a sequence of m−dependent random variables with a common one-dimensional probability density function and obtain an upper bound on Lp-losses for the such estimators.
متن کاملWavelet Based Estimation of the Derivatives of a Density for a Discrete-Time Stochastic Process: Lp-Losses
We propose a method of estimation of the derivatives of probability density based on wavelets methods for a sequence of random variables with a common one-dimensional probability density function and obtain an upper bound on Lp-losses for such estimators. We suppose that the process is strongly mixing and we show that the rate of convergence essentially depends on the behavior of a special quad...
متن کاملEstimation of protein coding density in a corpus of DNA sequence data.
A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic method...
متن کاملStatistical Topology Using the Nonparametric Density Estimation and Bootstrap Algorithm
This paper presents approximate confidence intervals for each function of parameters in a Banach space based on a bootstrap algorithm. We apply kernel density approach to estimate the persistence landscape. In addition, we evaluate the quality distribution function estimator of random variables using integrated mean square error (IMSE). The results of simulation studies show a significant impro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of mathematical biology
دوره 70 1-2 شماره
صفحات -
تاریخ انتشار 2015